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Tho  purpose  of  the  research  reported  in  this  paper  was  to  address  three 
practical  questions,  of  importance  and  interest  to  test  developers: 

.1^  What  are  the  effects  of  examinee  sample  size  and  test  .length  on 
the  precision  of  SEE  Curves? 

2)  What  effects  do  tho  statistical  characteristics  of  an  item  pool 
have  on  the  precision  of  SEE  Curves? 


What  is  the  relationship  between  test  lenvth  and  SEE  Curves  in 
'  typical  item  pools?  'f'YS'  -M*  *  ***  * 
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The  results  of  this  study  indicated  that:  (1)  both  test  length  and  ! 
sample  size  are  extremely  important  factors  in  the  precision  of  SEE  Curves, 
(2)  the  precision  of  SEE  Curves  at  the  extremes  of  an  ability  continuum 
would  be  acceptable  in  most  cases  if  the  curves  are  based  on 
200  or  more  examinees  with  tests  with  at  least  20  items  and,  (4)  the  most 
sizable  improvements  in  the  precision  of  SEE  Curves  occur  when  examinee 
sample  size  is  increased  from  50  to  200  and  when  test  length  is  increased 
from  10  to  20  items. 
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Therc  have  been  a  number  of  highly  successful  applications  of  IntolU 
trait  models  in  the  last  couple  of  years.  Reviews  of  many  of  these 
applications  are  provided  by  Hambleton,  Swaminathan,  Cook,  Eignor,  and 
Gifford  (1979),  Pentz  and  Rentz  (1978),  and  Weiss  (1978).  The  one-, 
two-,  and  three-parameter  logistic  latent  trait  models  have  been  used  by 
measurement  specialists  to  solve  problems  in  the  areas  of  tailored  testing 
(Weiss,  1978),  test  score  equating  (Lord,  1977,  in  press;  Marco,  1977; 
Rentz  &  Bashaw,  1977)  test  development  (Wright  &  Stone,  1Q7S)  ,  ;  t^m 

bias  (Lord,  in  press).  In  fact,  the  applications  cited,  and  others, 
have  been  so  successful  that  the  discussionsabout  the  use  of  latent  trait 
models  have  shifted  from  a  consideration  of  the  potential  of  latent  trait 
models  relative  to  classical  models,  to  a  consideration  of  (1)  latent 
trait  models  which  should  be  used  with  particular  measurement  problems  and 
(2)  technical  problems  (e.g.,  parameter  estimation  and  goodness  of  fit 
measures)  arising  in  connection  with  the  application  of  particular  latent 
trait  models. 

This  paper  was  prepared  to  report  some  of  our  recent  work  in  using 
the  three-parameter  logistic  model  in  test  development.  One  of  the  fea¬ 
tures  of  using  any  latent  trait  model  is  rhr  oossibility  of  specifying 
a  "target  information  curve"  and  then  selecting  test  items  from  an  itei 
pool  to  produce  a  test  with  the  features  characterized  by  the  "target 
information  curve."  A  target  information  curve  describes  the  desired 
level  of  "information"  at  each  point  on  the  ability  scale  underlying 
examinee  test  performance.  Information,  in  turn,  is  directly  related 


.  on 


to  the  degree  of  precision  of  ability  estimates  at  different  points 
on  the  ability  continuum.  In  fact,  as  long  as  a  test  is  not  too  short. 
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the  standard  error  of  estimation  at  a  particular  ability  level  is  equal 
to  one  divided  by  the  square  root  of  information  provided  by  the  test 
at  the  ability  level  in  question  (SEE  (D)  =  1/ / information  (0)  ).  In 
practice,  since  the  contribution  of  each  test  item  to  the  test  information 
curve  (referred  to  as  a  "score  information  curve"  when  item  parameter 
estimates  are  used  instead  of  the  item  parameter  values)  is  known  (once 
the  item  parameter  values  or  the  item  parameter  estimates  are  specified), 
it  is  possible  to  select  test  items  from  a  pool  of  "calibrated"  test  items 
(i.e.,  a  pool  of  test  items  with  associated  parameter  estimates)  to  pro¬ 
duce  a  "score  information  curve"  which  approximates  a  desired  "target 
information  curve."  With  the  three-parameter  logistic  model,  items  are 
described  by  three  parameters,  referred  to  as  "item  difficulty,"  "item 
discrimination,"  and  "item  pseudo-chance  level"  (Hambleton  et  al.,  1979). 

One  of  the  problems  with  the  paradigm  offered  above  for  test  devel¬ 
opment  is  the  imprecision  associated  with  the  item  parameter  estimates. 
Score  information  curves  (and  therefore  the  associated  standard  errors 
of  ability  estimates)  will  depend  on  the  precision  of  item  parameter 
estimates.  In  turn,  precision  of  item  parameter  estimates  is  influenced 
by  the  examinee  sample  size  used  to  estimate  the  item  parameters,  and 
in  the  case  of  the  item  discrimination  parameter,  estimates  are  influenced 
by  the  length  of  the  test.  This  study  was  designed  to  address  three 
practical  questions  which  are  of  some  importance  and  interest  to  test 
developers : 

1.  What  are  the  effects  of  examinee  sample  size  and  test 
length  on  the  precision  of  standard  error  of  ability 
estimation  curves? 
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2.  What  effects  do  the  statistical  characteristics  oi  an  item 
pool  have  on  the  precision  of  standard  error  of  ability 
estimation  curves? 

3.  What  is  the  relationship  between  test  length  and  standard 
error  of  ability  estimation  curves  in  typical  item 
pools? 

A  computer  simulation  study  was  chosen  as  the  mode  of  investigation 
for  the  three  questions  because  of  the  large  number  of  variables  which 
were  to  be  studied,  and  the  need  to  "know"  in  some  instances,  the  values 
of  the  item  parameters. 

The  remainder  of  the  paper  is  divided  into  four  sections:  (1) 
Background  on  Item  and  Score  Information  Curves,  (2)  Method  of  Investi¬ 
gation,  (3)  Results,  and  (4)  Conclusions. 

Background  on  Item  and  Score  Information  Curves 

Once  a  latent-trait  model  is  specified,  the  precision  with  which 
it  estimates  examinee  ability  can  be  determined.  Birnbaura  (1968)  defined 
the  notion  of  information  as  a  quantity  inversely  proportional  to  the 
squared  length  of  the  confidence  interval  around  an  estimate  of  an 
examinee's  ability.  The  standard  error  of  ability  estimation  is  equal 
to  1/ /informat ion.  When  information  at  an  ability  level  is  high,  narrow 
confidence  bands  around  the  estimates  result.  If  information  is  low, 
wider  confidence  bands  are  obtained.  Because  the  test  information  curve 
varies  with  ability  level,  it  has  been  suggested  that  test  information 
curves  ought  to  replace  the  use  of  classical  reliability  estimates  and 
standard  errors  of  measurement  in  test  score  interpretations. 

In  mathematical  terms.  Lord  (in  press)  gives  the  test  information 


curve  by 
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n  p'2 

1(0)  =  E  [1] 

g=i  pgQg 


and  the  standard  error  of  estimation  curve  by 


SEE(e)  = 


[2] 


In  the  expressions  above,  1(6)  is  the  amount  of  information  at  ability 
level  0,  SEE(0)  is  the  degree  of  precision  of  an  ability  estimate  at 
ability  level  0,  Pg  is  the  probability  of  a  correct  answer  to  item  g  by 
an  examinee  with  ability  level  0;  Qg  is  equal  to  1-Pgj  and  Pg  is  the 
slope  of  the  item  characteristic  curve  at  ability  level  0.  When  item 
parameter  estimates  are  used  in  Equation  [1],  Lord  (in  press)  substitutes 
the  term  "score  information  curve"  for  "test  information  curve." 

The  quantity  P'2/P  Q  is  the  contribution  of  item  g  to  the  infor- 
mation  curve  of  the  test  and  Is  referred  to  as  the  item  information 
curve.  Item  information  curves  have  an  important  role  in  determining 
the  accuracy  with  which  ability  is  estimated  at  different  levels  of  0. 
Each  item  information  curve  depends  on  the  slope  of  the  particular  item 
characteristic  curve  and  the  conditional  variance  of  test  scores  at  each 
ability  level  0.  The  higher  the  slope  of  the  item  characteristic  curve 
and  the  smaller  the  conditional  variance,  the  higher  will  be  the  item 
information  curve  at  that  particular  ability  level.  The  height  of  the 
item  information  curve  at  a  particular  ability  level  is  a  direct  measure 
of  the  usefulness  of  the  item  for  precisely  measuring  ability  at  that 


level. 
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Method  of  Investigation 

Description  of  the  Variables 

(a)  Test  Length 

Tests  of  three  lengths  were  considered:  10,  20,  and  80  items. 

A  test  with  10  items  is  about  as  short  a  test  as  is  used  in  practice 
and  therefore  the  10-test  item  length  was  studied.  An  80-item  test  was 
considered  because  the  length  represents  about  as  long  a  test  as  is  used 
in  practice. 

(b)  Ability  Distribution 

In  this  particular  study,  ability  scores  were  simulated  to  be 
normally  distributed  (mean  =0,  sd  =  1) .  This  assumption  was  made  to 
conform  with  a  very  important  assumption  made  in  the  item  parameter 
estimation  method  selected  for  the  study  (Urry,  1974).  Actually,  the 
parameter  estimation  method  used  is  a  slight  modification  of  the  one 
Urry  reported  in  his  1974  paper.  He  refers  to  this  new  method  as 
"ancillary  estimation  method."  Urry’s  method  was  chosen  for  the  study 
because  (1)  the  method  has  been  extensively  used  and  found  to  give 
acceptable  results  and  (2)  Urry's  computer  program  is  inexpensive. 

(c)  Sample  Size 

Three  examinee  sample  sizes  were  chosen:  50,  200,  and  1000.  The 
smallest  sample  size  (N=50)  is  considerably  smaller  than  anyone  should 
use  in  practice.  It  was  chosen  to  identify  the  "worst  possible"  results, 
that  could  be  expected.  The  other  two  sample  sizes  define  minimum  and 
maximum  sample  sizes  typically  used  in  test  development  work  with  latent 


trait  models. 
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(d)  Item  Pools 

Ranges  of  parameter  values  for  items  in  the  two  pools  are  shown 

below: 

Item  Range  of  Values 


Parameter 

Pool  One 

Pool  Two 

Difficulty  (b) 

-2.00  to 

2.00 

-1.00  to 

1.00 

Discrimination  (a) 

.  60  to 

2.00 

.  60  to 

1.50 

Pseudo-Chance  (c) 

.25  to 

.25 

.  25  to 

.  25 

The  differences  between  the  two  item  pools  can  be  described  as  follows: 
Items  in  pool  one  had  a  wider  range  of  difficulty  and  discrimination 
values . 


Simulation  of  Data 


The  eight  steps  in  the  simulation  study  were  as  follows: 

1.  Item  pool  one  was  selected  for  study. 

2.  A  test  length  (10,  20,  or  80  items)  and  a  sample  size  (50, 

200,  or  1000  examinees)  were  selected.  A  sample  of  examinee 
ability  score.,  ~ei.‘e  draw.,  from  ^  norma]  d-j stribution  (mean=0, 
sd=l) . 

3.  Using  a  computer  program,  DATAGEN  (Hambleton  &  Rovinelli,  1973) 
(1)  item  parameters,  given  the  constraints  of  the  item  pool 
under  investigation,  and  (2)  examinee  item  scores  were  produced 
The  computer  program  assumed  the  cor^ectnosc  of  *-he  rh.ree- 
parameter  logistic  model,  used  the  ability  scores  from  step  2 
and  item  parameters  generated  at  this  step,  to  produce  prob¬ 
abilities  of  correct  answers  for  examinees  to  the  test  items. 
These  probabilities,  in  turn,  were  converted  to  examinee  item 
scores  (0  or  1)  via  the  use  of  a  random  number  generator. 

4.  The  examinee  item  scores  from  step  3  were  used  in  Urrv's 
computer  program  to  estimate  item  and  ability  parameters. 
However,  only  the  item  parameter  estimates  were  used  further 
in  this  particular  study. 
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5.  The  item  parameter  estimates  were  used  in  Equation  2  to 
obtain  SEE(6) .  The  value  of  SEE(S)  at  seven  ability  levels 
(0  =  -3.00,  -2.00,  -1.00,  0.00,  1.00,  2.00,  3.00)  was  cal¬ 
culated. 

6.  Steps  3  to  3  were  repeated  three  times  to  obtain  three  esti¬ 
mates  of  SEE(0).  All  item  and  ability  parameter  values  for  the 
three  runs  were  identical.  The  particular  examinee  item 
scores  varied  from  one  run  to  the  next  because  of  the 
probabilistic  nature  of  the  score  outcomes. 

7.  Steps  3  to  6  were  repeated  for  each  combination  of  test  length 
and  sample  size  (3x3=9). 

8.  Steps  2  to  7  were  repeated  with  the  second  item  pool.  In  all, 

54  sets  of  test  data  were  considered  in  the  study. 


Results 

Effects  of  Sample  Size  and  Test  Length 
on  the  Precision  of  St andard  Error  of 
Ability  Estimation  Curves 

In  the  remainder  of  this  paper  "Standard  Error  of  Ability  Estimation 
Curves"  will  be  referred  to  as  "SEE  Curves”  for  convenience. 

Tables  1  to  6  contain  the  SEE  Curves  with  Item  Pool  One  obtained 
for  three  replications  of  three  examinee  sample  sizes  (N=50.  200,  1000) 
and  three  test  lengths  (n=10,  20,  80)  and  reported  for  seven  ability 
levels.  Table  1  to  3  and  4  to  6  contain  the  same  information.  «hat 
differs  is  the  way  the  data  are  organized  in  the  two  sets  of  Tables. 

Data  have  been  arranged  in  Tables  1-3  to  facilitate  an  examination  of 
the  effect  of  sample  size  on  SEE  Curves.  The  data  presented  in 
Tables  4-6  have  been  arranged  to  facilitate  an  examination  of  the 
effect  of  test  length  on  SEE  Curves.  Test  lengths  and  sample  sizes 

given  under  the  column  headed  "actual"  are  the  number  of  items  and 
examinees  remaining  after  a  satisfactory  set  of  item  and  ability  param¬ 
eter  estimates  are  obcained  from  Urry's  computer  program. 


Summary  of  Standard  h rror  ht.ir.ites  for  Variow.s 
and  Ability  Levels  with  a  Heterogeneous  It 
(Test  Length  =  10  Items) 


Summary  of  Standard  Krror  Lstinates  for  Various  Sample  Sizes 
and  Ability  Levels  with  a  Heterogeneous  Item  Pool 
(rest  Length  =  20  Items) 
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For  ease  of  interpretation,  the  same  data  repot'Led  ..n  Tables 
1  to  6  is  presented  in  graphical  form  in  Figure  1. 

Tables  7  to  12  contain  similar  data  to  Tables  1  to  6.  Tables 
7  to  12  contain  SEE  Curves  with  Item  Pool  Two.  (There  is  no  figure, 
however,  corresponding  to  Figure  1  for  Item  Pool  Two.)  Tables  13  and 
14  were  constructed  to  organize  the  data  reported  in  Tables  1  to  12  to 
facilitate  the  interpretation  of  results. 

(a)  Item  Pool  One — Effect  of  Sample  Size 
The  results  of  the  simulations  for  a  fixed  test  length  of  10  items, 
which  are  reported  in  Table  1,  clearly  show  the  lack  of  stability  of  the 
SEE  Curves  for  all  sample  sizes.  There  was  little  improvement,  if 
any,  due  to  increasing  sample  size.  This  result,  however,  may  be  due 
to  the  limited  amount  of  data  considered  since  improvements  were  obtained 
in  Item  Pool  Two  and  at  other  test  lengths. 

From  examination  of  Table  2,  which  contains  the  results  of  the  20 
item  simulations,  it  is  apparent  that  the  SEE  Curves  were  beginning  to 
stabilize.  Except  at  extreme  values  of  the  ability  continuum  the  results 
were  nearly  as  good  as  those  obtained  with  the  larger  sample  size  (N=1000) . 

At  a  test  length  of  80  items.  Table  2  clearly  shows  that  SEE  Curves 
are  highly  stable.  Similar  to  the  effect  noted  with  test  lengths  of  20, 
the  expected  decrease  in  variation  of  the  standard  errors  with  increase 
in  sample  size,  is  apparent  only  at  ability  levels  of  -1,  +1,  and  +2. 


10-ltem  test 
20-ltem  test 
80-lten  teat 


1  uO  u 


Summary  of  Standard  Error  Estimates  for  Various  Sample  Sizes 
and  Ability  Levels  with  a  Homogeneous  Item  Pool 
(Test  Length  =  10  Items) 
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( b )  ]_ tern  Pool  One — Effect  of  Test.  Length 

Examination  of  the  results  reported  in  Table  4  indicate  that, 
for  samples  of  size  50,  as  test  length  increased,  variation  in  the 
SEE  Curves  decreased  at  all  ability  levels. 

Tables  5  and  6,  which  represent  the  results  of  the  simulations 
for  sample  sizes  of  200  and  1000,  clearly  show  the  following  trends: 

(1)  the  most  stable  SEE  Curves  were  obtained  for  the  longest  test  length.; 
and  (2)  for  all  ability  levels,  variation  in  the  SEE  Curves  decreased  as 
test  length  increased. 

Table  13  presents  a  summary  of  the  data  found  in  Tables  1-6. 

Entries  in  this  table  are  the  standard  deviations  of  the  standard  errors 
of  estimate  obtained  across  the  three  replications  of  the  various  studies. 
Standard  deviations  are  reported  for  each  test  length-sample  size  combi¬ 
nation  across  five  ability  levels.  Also  included  in  Table  13  is  the 
average  of  the  standard  deviations  across  ability  levels  for  each  test 
length-sample  size  combination.  It  is  this  latter  value  that  is  the  focus 
of  the  following  discussion. 

Several  trends  are  apparent  from  examination  of  the  average  variation 
of  standard  errors:  (1)  the  variation  decreased  as  test  length  increased 
for  all  sample  sizes,  (2)  when  test  length  was  fixed  at  10  items,  sample 
size  had  little  or  no  effect  on  the  stability  of  the  SEE  Curves,  and  (3) 
sample  size,  generally,  had  a  noticeable  effect  on  the  stability  of  the 
SEE  Curves. 

Figure  1  contains  three  graphs  illustrating  the  effect  of  tes' 
length  and  sample  size  on  the  stability  of  the  SEE  Curves  at  five  ability 
levels.  Each  graph  represents  a  plot  of  the  values  of  the  SEE  Curves 
obtained  when  sample  size  was  held  constant  and  test  length  was  varied. 
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It  is  clear,  from  examination  of  these  graphs,  that  sample  size  has  little 
effect  on  the  stability  of  SEE  Curves  of  short  tests  (n=10) .  The  effect 
of  sample  size  on  the  stability  of  the  standard  errors  was  most  apparent 
for  the  intermediate  length  test  (n-20) .  For  a  long  test  (n=80)  sample 
size  showed  the  most  pronounced  effect  when  there  was  an  increase  iron 
50  to  200  examinees.  An  effect  was  also  noticed  when  sample  size  was 
increased  from  200  to  1000  examinees,  however,  the  improvements  in 
precision  were  more  modest  in  size. 

( c )  I tem  Pool  Two — Effect  of  Sample  Size 

Table  7  presents  the  results  of  the  simulations  involving  test 
lengths  of  10  items.  It  should  be  noted  that  no  values  are  reported  for 
ability  level  -3  and  also  that  the  only  complete  set  of  values  at  ability 
level  -2  are  reported  for  a  sample  size  of  200.  Values  obtained  at  these 
ability  levels  fluctuated  greatly  and  so  they  are  not  reported  (a  similar 
explanation  applies  to  other  results  not  reported).  In  summary,  there 
was  a  substantial  improvement  in  the  precision  of  SEE  Curves  for  in¬ 
creasing  sample  sizes.  In  fact,  the  improvements  in  precision  of  SEE 
Curves  due  to  sample  size  for  test  lengths  of  20  and  80  items  are  also 
clear  from  a  study  of  Tables  8  and  9. 

(d)  Item  Pool  Two — Effect  of  Test  Length 

The  results  of  this  investigation  are  reported  in  Tables  10-12. 
These  results  are  very  similar  to  those  obtained  for  item  pool  one  and 
therefore  will  not  be  discussed  to  any  great  extent.  It  is  important 
to  note  that  for  all  sample  sizes  and  at  all  ability  levels  there  appears 
to  be  fairlv  consistent  tendency  for  the  stability  of  the  SEE  Curves  to 
increase  as  test  length  was  increased. 
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Table  14  sumnar Lzes  the  results  reported  in  Tables  7-12.  Data 
are  arranged  in  Table  14  in  the  same  manner  in  which  they  were  arranged 
in  Table  13.  Examination  of  the  average  variation  across  ability 
levels,  indicated  that  for  all  test  lengths,  sample  size  has  a  notice¬ 
able  effect  on  the  stability  of  the  SEE  Curves.  In  comparison  to  the 
results  reported  in  Table  13,  the  effect  of  test  length  on  the  average 
variation  across  ability  levels  is  not  so  apparent.  Tire  reason  for 
this  is  the  smaller  variation  observed  for  short  tests  with  this  parti¬ 
cular  item  pool. 

Effects  of  Statistic  a  1.  Characteristics 
cf  an  Item  Pool  on  Precision  of  SEE  Curves 

A  comparison  of  the  results  reported  in  Tables  13  and  14,  indicated 
that  for  tests  of  20  and  80  items,  the  variation  in  the  SEE  Curves, 
averaged  across  ability  levels,  is  very  similar  for  both  item  pools. 

For  test  lengths  of  10,  the  situation  is  quite  different.  In  order  to 
make  the  average  variations  across  ability  levels  at  this  test  length 
comparable  for  both  item  pools,  these  values  were  recomputed  for  item 
pool  two,  excluding  the  values  obtained  for  ability  level  of  -2,  The 
recomputed  average  variation  values  are  .33,  .38,  and  .52  for  sample 
sizes  of  50,  200  and  1000  respectively.  It  is  clear  that,  for  short 
tests,  the  homogeneous  item  pool  (pool  one)  resulted  in  smaller  average 
variations  than  did  the  heterogeneous  item  pool.  A  second  point  worth 
noting,  is  that  the  heterogeneous  item  pool  (pool  two)  provided  more 
stable  Standard  Errors  at  an  ability  of  -2  for  test  lengths  of  10  or 
20  items  than  did  the  homogeneous  item  pool.  For  test  lengths  of  80, 
the  results  appear  to  be  about  the  same  for  both  item  pools.  It 
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should  also  be  noted  that  the  homogeneous  item  pool  generally  results  in 
greater  stability  of  Standard  Errors  for  ability  levels  between  +1  and 
-1  than  did  the  heterogeneous  item  pool. 

Relationship  Between  Test  Length  and  SEE  Curves 
in  Two  Typical  Item  Pools 

Figure  2  contains  two  graphs,  representing  item  pools  one  and  two. 
These  graphs  show  the  relationship  between  test  length  and  SEC  Curves. 

Item  parameters  were  used  to  derive  the  Curves  rather  than  estimates  of 
the  item  parameters.  The  trends  in  the  results  are  generally  what  one 
would  expect.  The  value  of  the  figure  is  the  information  it  provides  to 
test  developers  who  must  determine  a  test  length. 

Test  lengths  of  10  and  20  items,  drawn  from  the  heterogeneous  item 
pool  (item  pool  one)  do  not  show  the  expected  U  shaped  pattern  exhibited 
by  the  curves  obtained  for  these  test  lengths  w’hen  the  simulation  involved 
a  homogeneous  item  pool.  The  "humping"  effect  noted  at  the  center  of 
the  ability  distribution  is  due  to  the  particular  sample  of  items  chosen. 
There  are  a  few  less  items  selected  with  difficulty  values  close  to  zero. 
It  is  quite  apparent  that  the  heterogeneous  item  pool  provided  smaller 
standard  errors  of  across  a  wider  range  of  abilities  than  did  the 
homogeneous  item  pool. 

Further  insight  into  the  effect  of  the  item  pool  on  the  size  of 
the  standard  errors  can  be  obtained  by  examination  of  the  graphs  presented 
in  Figure  3.  Each  graph  represents  one  of  the  three  different  test 
lengths  that  was  studied.  The  relationship  between  test  length  and  SEE 
between  +3  and  -3  is  graphed  for  both  item  pools  on  the  same  axes  to 
facilitate  comparison  of  the  effect  of  the  item  pools.  The  decrease  in 
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the  size  of  the  standard  errors  as  test  length  increases  is  quite  evident 
for  both  pools.  Also  apparent  is  the  fact  that  tests  based  on  items 
drawn  from  the  heterogeneous  item  pool  provide  greater  precision  over  a 
wider  ability  range  then  do  tests  developed  from  the  homogeneous  item 
pool. . 


Conclusions 

A  study  along  the  general  lines  as  this  one  is  not  going  to  reveal 
any  major  new  results.  It  is  well-known  that  the  size  of  an  examinee 
sample,  the  length  of  a  test,  and  the  characteristics  of  an  item  pool, 
will  have  an  important  influence  on  the  shape  and  stability  of  SEE 
Curves.  The  importance  of  this  study  is  that  it  provides  data  concern¬ 
ing  the  size  of  improvements  in  SEE  Curves  relative  to  the  three  factors 
under  investigation:  (1)  sample  size,  (2)  test  length,  and  (3)  item 
pool  characteristics.  In  this  regard  several  conclusions  seem  warranted: 

1.  Both  test  length  and  sample  size  are  extremely  important 
factors  in  the  precision  of  SEE  Curves.  (There  wore  a 
small  number  of  reversals  in  the  results;  no  doubt  this 
was  due  to  sampling  fluctuations.) 

2.  Precision  of  SEE  Curves  at  the  extremes  of  an  ability 
continuum  Is  very  poor,  even  with  large  examinee  sample 
sizes.  The  results  are  substantially  better  when  tests 
are  lengthened,  even  if  the  sample  size  is  small  (N=50) . 

3.  The  precision  of  SEE  Curves  would  be  acceptable  in  most 
instances  if  the  Curves  are  based  on  200  or  more  examinees 
with  tests  with  at  least  20  items.  This  recommendation 
holds  if  primary  concern  is  with  values  of  the  Curves 

in  middle  regions  of  the  ability  continuum  [-1  to  +1]. 

A.  Increases  in  examinee  sample  sizes  from  50  to  200  pro¬ 
duce  sizeable  improvements  in  the  precision  of  SEE 
Curves.  Gains  in  precision  due  to  increasing  a  sample 
size  from  200  to  1000  produce  only  modest  gains  in  pre¬ 
cision  of  the  SEE  Curves. 
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5.  Similarly  for  test  lengths,  improvements  in  precision  were 
substantially  better  when  the  change  was  from  10  to  20 
items  than  20  to  80  items. 

Perhaps  by  offering  a  practical  testing  problem  that  arises,  we 
can  explain  our  interest  in  the  precision  of  SEE  Curves.  Suppose  a 
test  developer  selects  a  set  of  test  items  from  a  pool  of  items  for 
a  particular  test  he  or  she  desires  to  build.  Item  selection  is  usually 
based  on  the  item  statistics.  This  test  developer  may  then  calculate 
the  "expected"  score  information  curve  and  corresponding  SEE  Curve.  The 
usefulness  of  a  SEE  Curve  will  depend  on  its  precision.  If  we  knew 
that  a  second  administration  of  the  test  to  a  similar  group  of  examinees 
would  produce  a  radically  different  curve,  the  curve  will  be  of  little 
or  no  value.  The  results  of  our  study  suggest  that  if  an  item  pool  is 
"typical,"  the  stability  of  SEE  Curves  across  readministrations  of  the 
test  to  similar  groups  of  examinees  will  be  quite  good  if  the  test  in¬ 
cludes  at  least  20  items,  and  if  200  or  more  examinees  are  used  in 
deriving  the  item  statistics. 

We  hope  that  our  research  has  provided  at  least  a  few  guidelines 
to  aid  test  developers  in  determining  the  confidence  which  they  should  have  in 
SEE  Curves  that  arise  in  their  work.  If  it  also  serves  as  a  motivator 
to  further  extend  our  work  by  considering  other  aspects  of  the  problem 
(for  example,  the  shape  of  the  underlying  ability  distribution,  the 
number  of  parameters  describing  a  test  item,  and  methods  used  to  estimate 
parameters)  we  will  be  even  more  pleased. 
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